*City names in Acfrs database are not obvious, i.e., does not contain a certain word to indicate it’s a city
*The census population data use GEOIDs, which are “numeric codes that uniquely identify all administrative/legal and statistical geographic areas for which the Census Bureau tabulates data”.
*ACFRs data has government_id (which is called census_id in the ACFRs portal), but not geo_id.
*Use a “middle file” to link geo_id and government_ID
*Need to a) identify city and towns in census population data and b) link those to corresponding acfrs entities.
List of cities and towns with geo_id.
#Marc emailed this data Feb2 2022: using 017 Census master file data to map to the 2020 Census Population data using FIPS Codes. GEO_ID: Census identifiers that point to population of places in US
geoID_pop20_census <- rio::import(here::here("data", "City and Town Mapping.xlsx"), sheet = 2) %>%
rename(population = `!!Total:`,
geo_id = GEO_ID)
# join with df_state to get state.abb and state.name --> to have a common col with acfrs to join later
df_state <- data.frame(state.abb, state.name) %>%
add_row(state.abb = "DC", state.name = "District of Columbia")
pop <- geoID_pop20_census %>%
separate(NAME, c("city_town", "state.name"), sep = ",") %>%
mutate(state.name = str_trim(state.name),
city_town = str_trim(city_town)) %>%
left_join(df_state) %>% drop_na()
datatable(pop)
There are 31615 observations in this data set.
ACFRs database uses government ID (enumeration ID of local governments), but do not have geo_id.
We use a Census master file data to map ACFRs to the 2020 Census Population data by FIPS Codes.
#government_ID (used in ACFRs): enumeration ID of local governments. ACFRs data does not use FIP codes
# Note that population data is ACFRs portal is 2017 -> don't use
# Note: census_id, which means government_id in Acfrs are numeric --> can be turn to scientific notation in R --> will not match --> need to avoid scientific notation.
# This has both government_ID and geo_id
governmentID <- rio::import(here::here("data", "City and Town Mapping.xlsx"), sheet = 3) %>%
select(government_ID, `INFERRED GEO_ID`, NAME, CITY, STATE_AB, COUNTY_AREA_NAME) %>%
rename(geo_id = `INFERRED GEO_ID`, # Marc created INFERRED GEO_ID, which meant to be geo_id
original_name = NAME)
# Use governmentID to join with population
pop_governmentID <- pop %>%
left_join(governmentID) %>% drop_na(government_ID)
datatable(pop_governmentID %>% select(geo_id, government_ID, city_town, population))
*Get all general purpose entities in acfrs
*Join above acfrs entities with dataset of city and town population data that has governmentID
# census_id in ACFRs is government_ID used in file "City and Town Mapping.xlsx"
# 1. get all general purpose entities in acfrs
acfrs_governmentID <- readRDS("data/data_from_dbsite_2020.RDS") %>%
filter(category == "General Purpose") %>%
rename(government_ID = census_id) %>%
# problem: government_ID in pop_governmentID has 14 characters (05201902700000) >< some government_ID in ACFRs has 13 characters 5201902700000 --> Marc: Yes, it is safe to assume that these government_ID in ACFRs data should have 0 at the beginning.
mutate(government_ID = ifelse(str_length(government_ID) < 14, paste0("0", government_ID), government_ID))
# 2. join with city and town population data that has governmentID
acfrs_city_pop <- acfrs_governmentID %>%
left_join(pop_governmentID) %>%
select(-c(state, STATE_AB, year)) %>%
arrange(desc(population)) %>%
drop_na(population)
datatable(acfrs_city_pop)
At this point Acfrs now has 11682 general purpose entities FY 2020, of which 6810 are cities and towns.
Check if there’s any overlapping between the above lists
# all general purpose entities: 11187
length(unique(acfrs_governmentID$government_ID))
## [1] 11187
unique(acfrs_governmentID$government_ID) -> t1
# cities and towns: 6810
length(unique(acfrs_city_pop$government_ID))
## [1] 6810
unique(acfrs_city_pop$government_ID) ->t2
# gap bw the 2 above: 4377
length(unique(setdiff(t1, t2)))
## [1] 4377
# list of counties (in count_gov.Rmd): 2569
county_pop_census_acfrs <- readRDS("county_pop_census_acfrs.RDS")
length(unique(county_pop_census_acfrs$census_id))
## [1] 2530
unique(county_pop_census_acfrs$census_id) ->t3
# entities overlapping between city list and county list
intersect(t2, t3) -> intersect_city_county
# view the overlapping cities
acfrs_city_pop %>% filter(government_ID %in% intersect_city_county) %>%
select(name, government_ID)
## name government_ID
## 1 Macon-Bibb County 11201100100000
## 2 Marion County 15202700500000
## 3 Kankakee County 14204601000000
## 4 Adrian (Lenawee County) 23204600200000
## 5 Anaconda-Deer Lodge County 27201200100000
## 6 Shenandoah County 47207000200000
## 7 Bronson (Branch County) 23201200100000
## 8 Amherst County 47200500100000
## 9 Bangor (Van Buren County) 23208000100000
## 10 Louisa County 47205500100000
## 11 Appomattox County 47200600100000
## 12 Bessemer (Gogebic County) 23202700100000
## 13 Armada (Macomb County) 23205000100000
## 14 Birch Run (Saginaw County) 23207300100000
## 15 North Sherman County Rural Fire Protection District 38202900100000
## 16 Buchanan County 47201200100000
## 17 Beaverton (Gladwin County) 23202600100000
## 18 Halifax County 47204200200000
## 19 Baroda (Berrien County) 23201100100000
## 20 Benzonia (Benzie County) 23201000100000
## 21 Floyd County 47203200100000
## 22 Casnovia (Muskegon County) 23206100100000
## 23 Surry County 47209100300000
## 24 Madison County 47205700100000
## 25 Washington County 47207900100000
Those whose names contain parentheses are actually cities. The part in side the parentheses indicates which county they belong to.
state_gov <-readRDS("state_gov.RDS")
# composition of city, county, state gov in general purpose
# acfrs_governmentID %>%
# left_join(pop_governmentID) %>% #drop_na(population) %>%
# select(-c(state, STATE_AB, year)) %>%
# arrange(desc(population)) %>%
# take out county 2569
# filter(!government_ID %in% county_pop_census_acfrs$census_id) %>%
# filter(!id %in% state_gov$id)
Lastly, take 15 who are actually counties out of the cities list.
# actually county
acfrs_city_pop %>% filter(government_ID %in% intersect_city_county) %>%
filter(!grepl("\\)", name)) -> actual_county
acfrs_city_pop_final <- acfrs_city_pop %>%
filter(!government_ID %in% actual_county$government_ID)
datatable(acfrs_city_pop_final)
saveRDS(acfrs_city_pop_final, "acfrs_city_pop.RDS")
The list of city and county contains 6795 entities.